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Abstract 

A number of applications require the computation of the trace of a matrix that is implicitly available through 
a function. A common example of a function is the inverse of a large, sparse matrix, which is the focus of 
this paper. When the evaluation of the function is expensive, the task is computationally challenging because 
the standard approach is based on a Monte Carlo method which converges slowly. We present a different 
approach that exploits the pattern correlation, if present, between the diagonal of the inverse of the matrix 
and the diagonal of some approximate inverse that can be computed inexpensively. We leverage various 
sampling and fitting techniques to tit the diagonal of the approximation to the diagonal of the inverse. 
Depending on the quality of the approximate inverse, our method may serve as a standalone kernel for 
providing a fast trace estimate with a small number of samples. Furthermore, the method can be used as a 
variance reduction method for Monte Carlo in some cases. This is decided dynamically by our algorithm. 
An extensive set of experiments with various technique combinations on several matrices from some real 
applications demonstrate the potential of our method. 
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1. Introduction 

Computing the trace of a matrix A that is given explicitly is a straightforward operation. However, for 
numerous applications we need to compute the trace of a matrix that is given implicitly by its action on a vec¬ 
tor x, i.e., Ax. Specifically, many applications are interested in computing the trace of a function of a matrix 
F(A). Examples include estimating parameters in image restoration using the generalized cross-validation 
approach (l|, exploring the inverse covariance matrix in uncertainty quantification dSl, computing observ¬ 
ables in lattice quantum chromodynamics (LQCD) I®, or counting triangles in large graphs @. The matrix 
A is large, and often sparse, so its action F(A)x is typically computed through iterative methods. Because it 
is challenging to compute F(A) explicitly, the Monte Carlo (MC) approach has become the standard method 
for computing the trace by averaging samples of the bilinear form x T F(A )x (6j(71. The main purpose of 
this paper is to develop practical numerical techniques to address the computation of the trace of the inverse 
of a large, sparse matrix. But our technique can also be adapted to other functions such as the trace of the 
logarithm (yielding the determinant) or the trace of the matrix exponential. 

For small size problems, computing A -1 through a dense or sparse LDU decomposition is the most 
efficient and accurate approach 0101. This works well for discretizations of differential operators in low 
dimensions but becomes intractable in high dimensional discretizations. For larger size problems, domain 
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decomposition and divide and conquer strategies are more tractable but still expensive iflOl . In many cases, 
however, a low accuracy approximation is sufficient. Numerous methods have been presented to address this 
need for estimating the trace of the inverse of symmetric positive definite matrices through Gaussian bilinear 
forms Elm, modified moments fl2llT3l . and Monte Carlo (MC) techniques lITl'l lI 14 15 12.6: Uhl. 

MC methods for computing the trace of a matrix are based on the structure of the Hutchinson method 
fTl . which iteratively computes an average of matrix bilinear forms with random vectors. Variants of MC 
estimators are mainly analyzed and compared based on the variance of one sample EQ/ZJl, which depends 
on the choice of the selected random vectors. For real matrices, choosing random vectors having each 
element ± 1 with equal probability is known to minimize variance over all other choices of random vectors 
Id [6j and therefore has been widely used in many applications. For complex matrices, the same result 
holds for vectors with ±1,±/ elements. In fl6|, Avron and Toledo analyze the quality of trace estimators 
through three different metrics such as trace variance, (£,S)-approximation of the trace, and the number of 
random bits for different choices of random vectors. In iflTI . Khorasani and Ascher improve the bounds of 
(£,S)-approximation for the Hutchinson, Gaussian and unit vector estimators. 

There has been a number of efforts to combine the Hutchinson method with well-designed vectors 
based on the structure of the matrix fl4l fT5l T8 , |4|| • In lfl5l . the authors use columns of the Hadamard 
matrix, rather than random vectors, to systematically capture certain diagonals of the matrix. Then, the MC 
iteration achieves the required accuracy by continuously annihilating more diagonals with more Hadamard 
vectors. However, the location of the nonzeros, or of the large elements of A -1 , often does not coincide 
with the diagonals annihilated by the Hadamard vectors. In lfl8l , graph coloring and probing vectors are 
used to identify and exploit special structures, such as bandedness or decaying properties in the elements 
of A 1 , to annihilate the error contribution from the largest elements. However, if the error for the chosen 
number of colors is large, all work has to be discarded and the probing procedure repeated until the accuracy 
is satisfied. In 0, the authors introduce hierarchical probing on lattices to avoid the previous problems 
and achieve the required accuracy in an incremental way. For all these approaches, the approximation error 
comes from non-zero, off-diagonal elements that have not been annihilated yet. A different MC method 
samples elements of the main diagonal of the A -1 to estimate the trace. However, its variance depends on 
the variance of the diagonal which could be much larger than the variance of the Hutchinson method |j6[. 
This paper focuses on this type of method with the extra assumption that an approximation to the main 
diagonal of A~ 1 is available. 

Our motivation for focusing only on the main diagonal is that the trace of A -1 is simply a summation 
of a discrete, 1-D signal of either the eigenvalues or the diagonal elements of A -1 . Although we cannot 
compute all the diagonal elements, we may have an approximation to the whole signal from the diagonal of 
an approximation of A~' (e.g., of a preconditioner). If the two diagonals have sufficiently correlated patterns, 
fitting methods can be used to refine the approximation both for the diagonal and the trace. Therefore, the 
proposed method may serve as a standalone kernel for providing a good trace estimate with a small number 
of samples. But it can also be viewed as a preprocessing method for stochastic variance reduction for MC 
in cases where the variance reduces sufficiently. This can be monitored dynamically by our method. 

We present several techniques that improve the robustness of our method and implement dynamic error 
monitoring capabilities. Our extensive experiments show that we typically obtain trace estimates with much 
better accuracy than other competing methods, and in some cases the variance is sufficiently reduced to 
allow for further improvements through an MC method. 

2. Preliminaries 

We denote by ||.|| the 2-norm of a vector or a matrix, by N the order of A, by Z an approximation of 
A, by D the diagonal elements of A -1 , by M the diagonal elements of Z ', by Tr(F(A)) the trace of the 
matrix F(A), and by extension, Tr(D) the sum of the elements of the vector D, by T ei (F(A )) the MC trace 
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estimator of F(A) using unit (orthocanonical) vectors, by Tz 2 ( F(A )) the MC trace estimator of F(A) using 
Rademacher vectors, by diag{.) the diagonal operator of a matrix, and by Var(.) the variance operator of a 
random variable or a vector. 


2. 1. Hutchinson trace estimator and unit vector estimator 

The standard MC method to estimate the trace of the matrix inverse is due to Hutchinson (l|. It estimates 
the Tr(A ') by averaging s bilinear forms with random vectors z j eZj = {z(i) = ±1 with probability 0.5}, 

T Z2(A~ 1 ) = -iz]A- 1 z j . (1) 

4 7=1 

The variance of this method is given by 

VariTz^A- 1 )) = -\\A~% - - £ ||A|| 2 . (2) 

5 5 (=i 

The variance of this trace estimator is proven to be minimum over all vectors with real entries fT]. The 
confidence interval of a MC method reduces as 0(g/Var(Tz 2 (A~ 1 ))) for the given matrix. 

The unit vector estimator uniformly samples s vectors from the orthocanonical basis {<?],...,e,\;} l6l . 


N 




c -J 

S 7=1 


(3) 


where ij are the random indices. The variance of the unit vector estimator is given by 

, N 2 

Var(T ei (A- l )) = —Var{D). (4) 

The variance of the Hutchinson method depends on the magnitude of the off-diagonal elements. It converges 
in one step for diagonal matrices and rapidly if A -1 is highly diagonal dominant. On the other hand, the 
variance of the unit vector estimator depends only on the variance of the diagonal elements. It converges in 
one step if the diagonal elements are all the same and rapidly if the diagonal elements are similar. Thus, the 
method of choice depends on the particular matrix. 


2.2. Reducing stochastic variance through matrix approximations 

Given an approximation Z « A, for which Z 1 and Tr(Z 1 ) are easily computable, we can decompose 

7Y(A -1 )=7Y(Z -1 ) + 7>(£), (5) 

where E = A 1 — Z 1 . We hope that by applying the MC methods on E instead on A -1 , the variance of 
the underlying trace estimator, in ([2]) or 0, can be reduced, thereby accelerating the convergence of MC. 
Among many ways to obtain a Z, we focus on the following two. 

The first approach is when Z 1 = (LI/) -1 , where the L, U matrices stem from an incomplete LU (ILU) 
factorization of A, one of the most commonly used preconditioners. If the ILU is sufficiently accurate, then 
M = diagiZ ') may be a good approximation to D. To obtain the vector M without computing the entire 
Z , we can use an algorithm described in lfl9ll . This algorithm requires the computation of only those 
entries Z (/ 1 for which L, j or Ujj / 0. If the L, U factors are sufficiently sparse or structured, this computation 
can be performed efficiently (see EOll for an example in the symmetric case). 

The second approach is a low rank approximation Z 1 = V / £ 1 1/ 7 } where £ is a diagonal matrix with 
the k <C IV smallest singular values of A, and U and V are the corresponding left and right singular vectors. 
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Eigenvalues and eigenvectors can be used instead but we do not consider it in this paper. This subspace can 
be obtained directly by an iterative eigensolver Il2lll22l as a preprocessing step. The cost of this procedure 
is relatively small since the singular space is not needed in high accuracy. Alternatively, the space can be 
approximated by methods such as eigCG ll23l or eigBiCG ll24ll while solving linear systems of equations 
during the MC method. This incremental approach adds only minimal overhead to the iterative linear solver 
but it cannot compute as many and as good quality singular vectors as the first approach. The quality of 
the approximation of A -1 by Z 1 depends on the separation of the computed singular space. Therefore, 
depending on the matrix, more singular triplets may be needed for a good low rank approximation. On 
the other hand, this space can also be used to deflate and thus accelerate subsequent linear systems. Once 
the singular space is computed, each diagonal element can be obtained with a single inner product of short 
vectors, M, = V (i, :) T U(i, :)/o,, and thus it is computationally inexpensive. Finally, the incremental SVD 
approach requires some special algorithmic attention during our algorithm, which will be pointed out later. 

A computationally inexpensive, albeit less accurate approach for computing an approximation M is 
based on variational bounds on the entries of A -1 11711251. Upper and lower bounds on the i-th diagonal entry 
A - 1 are derived inexpensively since they only depend on estimates of the smallest and largest algebraic 
eigenvalues, An ,Ajv, and the entries of A. The bounds apply to both symmetric and unsymmetric matrices. 
For the case of a real symmetric A, we have |25l , 

1 (Ay —Ag) 2 i 1 

+ A*(A N A U - s u ) >u ~ h A! (s u - AiAfi) ’ ^ ; 

where ,v ;/ - = A^A/y. However the bounds in ([h]) will not be shaip especially the upper bound llT2ll and 
the error in the approximation can be large. 

In general, unless Z 1 is highly accurate, we do not expect Tr(M ) to be close to Tr(A However, the 

patterns of M and D often show some correlation. We demonstrate this for two example matrices, delsq50 
and orsreg2205, in Figures [T] and [2] using IFU and SVD respectively. For matrix orsreg2205, both the IFU 
and SVD approaches return an approximate diagonal M which captures the pattern of D very well, with the 
M returned by IFU being slightly better than the one from SVD. For matrix delsq50, SVD clearly captures 
the pattern of D better than what IFU does. As in preconditioning for linear systems of equations, the 
appropriate approximation technique depends on the given matrix. 



(a) M does not captures pattern of D (b) M quite close to D and captures pattern 


Figure 1: The pattern correlation between the diagonals of A 1 and its approximation Z 1 computed by ILU(O) on matrices (a) 
delsq50 and (b) orsreg2205. delsq50 is created in MATLAB by delsq (numgrid (' S' ,50) ) . 
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(a) M not close to D but captures pattern (b) M not close to D but captures pattern 

Figure 2: The pattern correlation between the diagonals of 1 and its SVD approximation Z 1 computed from the 20 smallest sin- 
gular triplets of A on matrices (a) delsq50 and (b) orsreg2205. delsq50 is created in MATLAB by delsq (numgrid (' S' ,50) ) . 


2.3. Comparison of different MC methods and discussion on importance sampling 

Based on (|2]-[4|), we express the variance of the trace estimators Tz 2 (E) and T ej (E) as follows: 

Var{T z ffE)) = -\\E\\ 2 - - £ \\diag(E)\\ 2 , 

s s i=i 


(V) 


Var(T e fE )) 


N 2 
s 


Var(diag{E)). 


( 8 ) 


Figures [T] and [2] show there is potential for the variances of Tz, (E ) or T e . (E ) to be smaller than those of 
Tz 2 (A 1 ) and T ej (A 1 ). Flowever, for a given matrix, we must gauge which MC method would be better, 
and whether the variances need further improvement. 

The estimator T ej (E ) has the interesting property that if M = D+c, where c is a constant, then its variance 
in ([8]) is zero and we obtain the correct trace in one step. Although we cannot expect this in practice, it means 
that the shift observed between M and D in Figure [2(a)] should not affect the effectiveness of T e .(E ). 

On the other hand, T e .(E) fails to identify correlations of the form M = cD. For such cases, importance 
sampling is preferred, where M plays the role of a new distribution simulating the distribution of D. Assume 
that both D and M have been shifted by the same shift so that M,- > 0 if D,- > 0, / = 1..... /V. To transform M 
into a probability mass function, let G = Tr M. To obtain an estimator of the trace of D with importance 
sampling, we replace the uniform sampling of D, values with sampling with probability G, 11261 . Then, 
instead of ([3]), the importance sampling estimator is: 


Tis(D) 


N 


Ea, 


7=1 


1 

N 

G u 


Tr(M) * Djj 
S 


(9) 


When M = cD, the variance of Tjs(D) is zero and it finds the trace in one step. Flowever, it completely fails 
to identify shift correlations. In general D and M may have a more complex relationship that neither Tjs(D) 
or T ej (E) can capture. This motivates our idea to explore general fitting models to approximate D. 


3. Approximating the trace of a matrix inverse 

We seek to construct a function /, such that D ss /(M). Then we can decompose 

Tr(A- l ) = Tr(D-f(M)) + Tr(f(M)). (10) 
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Tr(f(M)) is trivially computed for a given /. A key difference between the approaches in ( fiT)| ) and ^ is 
that it is easier to find a fitting of the two vectors M and D if a strong pattern correlation exists between 
them than to fit all corresponding elements in matrices A~ l and Z 1 . If Tr(f(M)) is a good approximation 
to Tr(A~ l ) and its accuracy can be evaluated easily, we can directly use this quantity; Otherwise, we can 
apply the unit vector MC estimator to compute Tr(Efj t ) = Tr(D — provided that its variance 

N 2 

Var(T ei (E fit )) = —Var(E fit ) (11) 

is smaller than the variances in ([2]), Q, (|7|) and <[8]». 


Algorithm 1 Basic algorithm for approximating Tr(A 


Input: A E R NxN 

Output: Tr(A~ l ) estimation and Z E M' Vx,v 

Compute M = diag(Z '), where Z 1 is an approximation to A ] (Section |2.2[ ) 

Compute fitting sample S/, f , a set of k indices (Section 3.1) 

Solve linear systems D, = efA 1 e t , Mi E Sju (Section 2.2 1 

Obtain a fitting model f(M) ~ D by fitting f(M(Sfu)) to D(Sjj,) (Section |3.2[ ) 

Compute refined trace approximation T ei {Efu) using <[3]) 

Estimate the relative trace error and, if needed, the variances for different MC methods (Section [4]) 


The basic description of the proposed estimator is outlined in Algorithm [7] First, our method computes 
an approximation M of D using one of the methods discussed in the previous section. If that method is based 
on a preconditioner or a low rank approximation, that preconditioner Z 1 could also be used to speed up 
the solution of linear systems in step 3. Second, it finds a fitting sample Sf lt , a set of indices that should 


capture the important distribution characteristics of D. Since we have no information about D, Section 3.1 


discusses how to tackle this task by considering the distribution of M. Third, it computes the values of 
D(Sfu) by solving the corresponding linear systems using a preconditioned iterative solver. Since this is the 
computational bottleneck, the goal is to obtain good accuracy with far fewer fitting points than the number of 
vectors needed in MC. Fourth, it computes a fitting model that has sufficient predictive power to improve the 
diagonal approximation. This critical task is discussed in Section 3.2 Finally, since there are no a-posteriori 
bounds on the relative error of the trace, we use a combination of statistical approaches and heuristics to 
estimate it incrementally at every step. If the estimated error is not sufficiently small, our method can be 
followed by an MC method. For this reason, the algorithm also estimates dynamically the variances of the 
two different MC estimators (|7]l and ([8]) so that the one with the smallest variance is picked. The dynamic 
evaluation of error and variances is discussed in Section [4] 


3.1. Point Identification Algorithm 

We need to identify a set of indices Sf lt based on which we can compute a function / that fits f(M(Sfi t )) 
to D(Sfn ) so that \Tr(f(M)) — Tr(D)\ is minimized. It is helpful to view Tr(D) as an integral of some 
hypothetical one dimensional function which takes the values D{i) = D, on the discrete points i = I. 
Our goal is to approximate Jq D(x)dx with far fewer points than n. Because it is one dimensional, it may 
be surprising that the Monte Carlo approach can be advantageous than numerical integration. However, the 
effectiveness of any numerical quadrature rule relies on the smoothness of the function to be integrated, 
specifically on the magnitude of its higher order derivatives. In our case, our hypothetical function may 
have no smoothness or bounded derivatives. More practically, we deal with a set of discrete data points in 
D for which smoothness must be defined carefully. A typical definition of smoothness for discrete data is 
based on the Lipschitz continuity, i.e., |D, — Dj < c\i — j\ l l27ll . The lower the constant c, the smoother the 
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set of data. Hence, for a continuous function, the larger the magnitude of its first derivative, the less smooth 
its discretized points are. In the context of integrating an arbitrary D, the first order divided difference (i.e., 
the discretized first derivative) D, — Dj.\ may be arbitrarily large in magnitude and therefore numerical 
integration may not work any better than random averaging. 

Even for matrices that model physically smooth phenomena (e.g., in PDEs), the matrix may be given in 
an ordering that does not preserve physical locality. 

Consider a sorted permutation of the diagonal, D = sort(D). Obviously Tr(D ) = Tr(D), but D is mono¬ 
tonic and, based on the definition of Lipschitz discrete smoothness, it is maximally smooth, i.e., 

\D(i)-D(j)\<A\i-j\, (12) 

for the smallest possible A G M + among all permutations of Monotonicity implies that, in the absence 
of any additional information about the data, a simple trapezoidal rule minimizes the worst case integration 
error |f28l . In addition, if we are allowed to choose the integration points sequentially, based on the points 
computed so far, then a much better average case error can be obtained |29l [30:]. On the other hand, if 
bounds are known on the discrete smoothness of D, better worst case error bounds can be established. Since 
D, however, is not available, we turn to its approximation M. 

A close pattern correlation between M and D means that the elements of M should have a similar dis¬ 
tribution as those of D , or that M = sort(M) should be similar to D. Let us assume for the moment, that 
the index that sorts D to D sorts also M to M. In other words, we assume a complete correlation between 
the monotonic orderings of M and D even though their values may differ. Then, we can work on the surro¬ 
gate model M for which we can afford to identify the best quadrature points that yield the smallest error in 
Tr{M). These will be the ideal points over all permutations of M for computing the fitting function /. 

Specifically, we need to select indices that capture the important distribution changes in M. For exam¬ 
ple, identifying minimum and maximum elements of M sets the range of approximation for / and avoids 
extrapolation. We also look for entries in M that deviate highly from their neighbors (where the first order 
divided difference of M and hopefully of D has a large value). This strategy has three advantages. First, 
between such indices the data is smooth so the integral Tr{M) should be captured well by the trapezoidal 
rule. Second, and more important, we can obtain a more accurate fitting function / in a piecewise manner 
in intervals where D has similar behavior. Third, in this way we avoid picking points with the same value 
Mj = Mj which can create problems in the fitting function. 

The proposed index selection method is shown in Algorithm [2] Initially the set of sampled indices, 
SfU, includes the indices of the extrema of M, 1 and N. Then, for every interval (/', /), with i.j consecutive 
indices in S/u, we find the index t from /+ 1 to j — 1 that minimizes the trapezoidal rule error for computing 
Tr(M{i : j)): 

argmin (|(M(i)* ( i-j) - (M(i) -M{t)) * {i~t) ~ ( M(t)-M(j )) * (t - j) |). (13) 

i<t< j, te z 

The process continues until it reaches a maximum number of sampling points or until the maximum error 
over all intervals decreases by a fixed value, say 0.001. The value of this threshold depends on how close 
the patterns of M and D are correlated, a question we had postponed and we discuss next. 

The points selected by the previous algorithm minimize the error for computing Tr(M) with only S/i, 
points. How good are these points for fitting M to D and thus obtaining a small error in Tr(D)l Consider 
the M and D as discrete functions from M and assume they are bijective in their - ranges. Let 


1 It is easy to see that A = max,= 2 ^ \D(i) —D(i— 1)| = D(io) —D(io — 1) is the smallest value that satisfies 0 for the sorted 
array. Assume there is a different permutation D' that satisfies with 8 < A. If D 1 (k) = D(io), then 8 > |D'(*)-D'(*-1)|> 
D((q) —D(io — 1) = A, which contradicts the assumption. 
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J,G : {1— > {1be the two permutation functions that sort M to M and D to D, i.e., M(i) = 
M(J(i )) and D(i) = D{G(i)). Our method relies on the assumption that the distribution of M captures the 
distribution of D, or that there exists a smooth function / that can be approximated well with a low degree 
polynomial that fits D(i) = f{M(i)). For any i, there arc two indices m = G(i).k = J(i) with m / k in general 
that satisfy, 

Dim) = D(G(i )) = D(i) = /(M(/)) = ))) = f(M(k)). 


Our algorithm first picks an index i for M and derives the original index k = J(j) in M. Since the permutation 
G is unknown, we do not know the D(m) that should be matched with M(k). Thus, the algorithm makes 
the simplifying assumption that G = J and computes D(k ) instead. However, this corresponds to a different 
D(j), where G( j) = k. Therefore, using (12 1 , we can bound the error for this mismatch by 


\D{i) —D(j) \ < A\i — j\ = A|/ _1 (k) — G~ l {k)\. 


(14) 


This implies that as long as the permutations J and G are locally similar, or in other words they do not shuffle 
the same index of M and D too far from each other, the error is small and therefore the fitting should work 
well. On the other hand, if M is a random permutation of D, there is no good fitting, even though M = D. 
This implies that the traces of f{M) and D may be very close but not their individual elements. It also means 
that its variance of Ef lt may not always be small (see the examples in the following section as well as in 
Section |4~2| ). 

In practice, since we do not know the permutation G, it is possible that some flat area of M is associated 
with important changes in D. To alleviate the effect of a possible local pattern mismatch, we instead empir¬ 
ically insert the midpoint of the current largest interval every k samples (we choose k = 5 in lines 11-14). 
Returning to the choice of threshold in the algorithm, we see that going below a threshold helps the accu¬ 
racy of Tr{M) but not necessarily of Tr{D). The reason is that the greedy point selection strategy becomes 
less effective in determining the local mismatching patterns. Therefore, we terminate searching for a new 
index based on ( fl3| ) if the maximum error is less than 0.001. We found this threshold to be sufficient in our 
experiments. If the maximum required samples have not been generated, we continue by simply bisecting 
the largest intervals until mcixPts is reached (in lines 16-18). On exit, we compute Sju which maps S/u to 
the original unsorted ordering. 

A typical sampling result produced by our method is shown in Figure [3] The graph on the left shows 
the diagonals M and D plotted in their original order. The pattern correlation between them is clear, but it is 
less straightforward how to pick the fitting points. Figure [3(b)] shows both diagonals plotted with the order 
of indices that sorts M, i.e., M(J).D(J). The points picked by our algorithm based on M correspond to an 
almost monotonic sequence of points in D{J). The associated indices can then be used to identify a suitable 
set of entries in D to perform the numerical integration. 


3.2. Two Fitting Models 

Next we construct a fitting model that minimizes ||/(M(5/ ; - f )) —D(S// f )||. The fitting model must have 
sufficient predictive power with only a small number of points and it should avoid oscillating behavior since 
we work on the monotonic sequence M which we assume correlates well with D. Therefore, we consider a 
linear model and a piecewise polynomial model. 

The MC methods in ([5]) and ( [TO] ) can resolve the trace when D = M + c while importance sampling can 
resolve the trace when D = cM. To combine these, we first use a linear model, y = bM + c. We determine the 
parameters b,c by a least squares fitting, argmin^ ceR ||D(S/, f ) — ( bM(Sfi t ) + c)|| 2 - The linear model may be 
simple but avoids the large oscillations of higher degree polynomials, and in many cases it is quite effective 
in improving the accuracy of the trace estimation and reducing the variance of the diagonal elements of 
The linear fitting algorithm is described in Algorithm [3] Figure |4fa)] shows the fitting result on the example 
matrix of the previous section, in the original order of D. 







Algorithm 2 Point identification algorithm based on the trapezoidal rule 
Input: M £ M' v and maxPts 
Output: Sfif- desired sampling index set 
1: [M, J] = sort(M) 

2: numSamples = 1, initErr = tempErr = \Tr(M) — [M( 1) \ 

3: Add 1 ,N into S/i, and push interval (1. TV) and its tempErr in the queue Q including interval and error 

4: while numSamples < maxPts and tempErr > 0.001 * initErr do 

5: Pop interval (L. R) with largest error from Q 

6: for t =L+l : R — ldo 

7: Use ( [13] ) to find a bisecting index t 

8 : end for 

9: Add t into Sfu, numSamples = numSamples + 1 

10: Push intervals ( L. t ) and (t.R) each with their corresponding tempErr in Q 

11: if numSamples is a multiple of 5 then 

12: Insert one midpoint index into largest interval in S/u, numSamples = numSamples + 1 

13: Insert its corresponding left and right intervals in Q 

14: end if 

15: end while 

16: while numSamples < maxPts do 

17: Insert middle index of the largest interval into S, numSamples = numSamples + 1 

18: end while 

19: Return Sf, indices in original ordering, such that Spu = J(Sfu ) 



■°- 2 0 1 000 2000 3000 4000 5000 

Index of M 

(a) Distribution of elements of M and D 



Figure 3: A typical example to show our sampling strategy based on the pattern correlation of M and D. M is computed by 20 
singular vectors of matrix RDB5000. In the right figure, the magenta and green circles denote the sample points associated with the 
sampling indices Sfn in M and D. 


The linear model preserves the shape of M, and therefore relies exclusively on the quality of M. To take 
advantage of our premise that the distribution M approximates well the distribution of D, our next fitting 
model is the Piecewise Cubic Hermite Spline Interpolation (PCHIP). It was proposed in OTI to construct a 
visually pleasing monotone piecewise cubic interpolant to monotone data. The PCHIP interpolant is only 
affected locally by changes in the data and, most importantly, it preserves the shape of the data and respects 
monotonicity. Therefore, we work on M and the indices Sj lt = [1 = ,V| .si, - ■■ = /V] which are given 
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Algorithm 3 Linear least squares fitting model for approximating Tr(A 1 ). 
Input: A € R NxN 
Output: Tr(A~ l ) estimation: Tf 
1: Compute M using ILU or Eigendecompostion or SVD on A 
2: Call Algorithm [2] to compute sample set Sf lt . 

3: Find [b,c\ = argmin||D(5/,- f ) - (bM(S fit ) + c) || 2 . 

4: Compute trace approximation Tf = +c) 


in an order such that a = M(s\) < M(s-i) < ■ ■ ■ < M(sk ) = P is a partition of the interval [a, |3]. An index Sj 
corresponds to the index J x (s,-) in the original ordering of M, where J is from Algorithm bland J 1 denotes 
the inverse mapping of the sorted list. Thus, for each s t we compute D(J 1 (.v;)),/ = 1 and we use 

PCHIP to construct a piecewise cubic function such that, 

p(M( Si ))=D(J-\ Si )), i = (15) 

Notice that p(x) will be monotone in the subintervals where the fitting points D(J '(.v/)) are also monotone. 
Therefore, as long as M is close to D in the sense of ( [hf] ), integration of p(x) will be very accurate. 

The PCHIP model is given in Algorithm[4] The first two steps are the same as in Algorithm[3] In step 3, 
we apply the function unique to remove the duplicate elements of M(Sfit)) to produce a sequence of unique 
values as required by PCHIP. This yields a subset of the indices, S'fin which is mapped to original indices as 
I =./ 1 (S , j h ) to be used in PCHIP. 


Algorithm 4 PCHIP fitting model for approximating Tr(A 1 ). 

Input: A € R NxN 

Output: Tr(A~ l ) estimation: Tf 

% Steps 1 and 2 are the same as in Algorithm [3] 

3: Remove duplicates: S'f it = unique(M(5/; f )), / = J~ x {S'f it ) 

4: Apply PCHIP to fit p(M(I)) = D(I) and obtain a polynomial p(M) ~ D 
5: Compute trace approximation Tf = p(fH(i)) 



0.15 

0.1 


2000 3000 

Index of M 



(a) Linear LS model in original order 


(b) PCHIP model in sorted order 


Figure 4: Fitting results of the matrix RDB5000 in original order and sorted order with linear LS model and PCHIP model. 

Figure [4(b)] shows the result of fitting on the RDB5000 example of the previous figures. Table [T] com¬ 
pares the relative error for the trace, as well as the variances of D,D — M and D — p(M) when using the linear 
LS and PCHIP models on two test matrices, OLM5000 and KUU. In both cases, the two models can provide 
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a trace estimate of surprising relative accuracy of the order of le-2 with only 20 fitting points. In addition, 
for matrix OLM5000, the standard deviation of MC on D — p(M) is reduced by a factor of 10 compared to 
that of MC on D (a speedup of 100 in terms of samples). However, for matrix KUU, the standard deviation 
of MC on D — p(M ) does not improve much over MC on D. This reflects the discussion in Section 3.1 and 
suggests that our method can serve as a standalone kernel for giving a fast trace estimate. In addition, in 
some cases the method can reduce the variance significantly to accelerate a second stage MC. 


Table 1: Comparing the trace relative error and variances in |4|, |8]t and for matrices OLM5000 and KUU, using the linear 
LS and PCHIP models with 20 fitting points each. The traces of matrix OLM5000 and KUU are Tr{A~ x ) = —5.0848e + 02 and 
3.6187e + 03, respectively. 


Matrix 


OLM5000 



KUU 


Model 

M 

LS 

PCHIP 

M 

LS 

PCHIP 

Relative error 

9.6650e-01 

2.2320e-02 

1.4288e-02 

5.8316e-01 

1.2207e-02 

1.4469e-02 

Std(T ei (A - 1 )) 

1.1425e+02 

1.1425e+02 

1.1425e+02 

2.8731e+02 

2.873 le+02 

2.873 le+02 

Std(T ei {D-M)) 

1.1007e+02 

1.1007e+02 

1.1007e+02 

1.4999e+02 

1.4999e+02 

1.4999e+02 

Std(T ei (D-p(M))) 

- 

2.0332e+01 

1.7252e+01 

- 

1.6137e+02 

1.6289e+02 


4. Dynamic Evaluation of Variance and Relative Trace Error 

Since there are no a-posteriori bounds for the accuracy of our results, we develop methods that use 
the information from the solution of the linear systems to incrementally estimate the trace error and the 
variances of the resulting approximations. This approach is also useful when M is updated with more left 
and right eigenvectors or singular vectors obtained from the solution of additional linear systems. 

4.1. Dynamic Variance Evaluation 

To decide which MC method we should use after the fitting stage or even whether it is beneficial 
to use the fitting process for variance reduction, we monitor incrementally the variances Var(T ei (A~ 1 )), 
Var(T ej {Efi t )), Var(Tz } (A ')), and Var(Tz 2 (E)), with the aid of the cross-validation technique f32ll . 

Our training set is the fitting sample set D(Sfn), while our test set D(S mc ) is a small random set which is 
independent of the fitting sample set. If we want to combine our method with MC, eventually more samples 
need to be computed, and thus we can pre-compute a certain number of them as the test set D(S mc ). We 
have used the holdout method ll33l . a single train-and-test experiment for some data splitting strategy since 
the fitting sample set is fixed. 

To compute D(Sfi t ) or D(S mr ) , a linear system with multiple right hand sides is solved as follows: 

A ; - ( - — C/ Xi , Axj — €[. Vi E Sfh U S mc . (16) 

The computed column vectors jq can be used to estimate the Frobenius norm of both ,4 1 and E = A 1 — Z 1 
mm. Then Var(Tz^{A 1 )) and Var(T Zl (£)) can be estimated as follows: 

Var(T Z2 (A~ 1 )) & ^ £ (N | 2 - |Af), V/ G S f i t US mc , (17) 

S i 

2N W— , O r, 

Var(T Z2 (E)) ^ ^^{\\E(:,i)\\ 2 -\E(i,i)\ 2 ), VieS fit US mc , (18) 
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where £'(:,/) = £<?,. Simultaneously, based on the sampled diagonal elements A u 1 , we can also update the 
evaluation of Var(T ej (Efi t )), Var(T C) (E j) and Var(T ej {A~f). Here we only show the computation of the 
unbiased variance estimation for Var(T ej (Efn )) by: 

N 2 

Var(T ei (E fit )) « —Var(E fit (S mc )). (19) 

Note that Spa should not be used for estimating the variance of the unit vector MC estimator since these 
sample points are exact roots of the PCHIP function. 


Algorithm 5 Dynamic variance evaluation algorithm for estimating variances of different MC methods 
Input: A € R NxN 

Output: Tr(A~ l ) estimation and variances estimations of various MC methods 
1: Initialize maxPts, Sfit, S Inc 
2: if M is computed using ILU on A then 
3: Compute the approximation M 

4: end if 

5: Generate random index set S mc without replacement and compute MC samples D(S mc ) 

6: for i = 5 : 1 : maxPts do 

7: if M is computed using Eigendecompostion or SVD on A then 

8: Update M with 2 * i number of left and right eigenpairs or singular triplets 

9: end if 

10: Call Algorithm [2]to find more indices so that Spa has i fitting points 

11: Call Algorithms^] or |d] to update approximation of Tr(A 1 ) 

12: Estimate variances of different MC methods based on ( fT7] ), ( fT8j ) and ([T9]) 

13: end for 


We implement the dynamic variance evaluation scheme in Algorithm[5] In lines 2-4 and 7-9, the approx¬ 
imation M can be computed using an ILU factorization at the beginning of the procedure or be updated with 
increasing number of singular triplets or eigenpairs. Note that if M is obtained by a partial eigendecomposi- 
tion or SVD, the updated M is different in two consecutive steps, and thus Algorithm [2] will return a slightly 
different index set Spa which may not be incremental. This may not provide a consistent improvement of 
the relative trace error during the fitting progress. In line 10 of the algorithm, we force the points to be 
incremental between steps i and / + 1 as follows; we generate the entire set Sjf 1 and remove the indices 

that lie the closest to the previous index set Spj r The remaining index set is incorporated into S^j t . This 
simple scheme works quite well experimentally. 

Figure [5] shows how the actual and the estimated variances for three MC methods match for the test 
matrix RDB5000. In addition, the relative difference between different MC methods becomes clear after 
only a few points which facilitates not only the proper choice of MC method but also an early decision to 
stop if further fitting is not beneficial. In the numerical experiments section we show that these results are 
typical for matrices from a wide variety of applications. 


4.2. Monitoring Relative Trace Error 

As discussed in Section 3.1 and showed in Table [T] the variance of MC on the vector Epa may not be 
reduced if the sorting permutations of M and D are dissimilar. Even in such cases, however, our fitting 
method can provide a good trace estimation with only a small number of samples. We further investigate 
this by comparing the elements of D and p(M) with different orderings. Figure 6(a) shows the elements 
of p(M(J )) and D(J), i.e., with respect to the order of M. It illustrates that although p(M) captures the 
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io 4 


Matrix rdb5000: Variance Estimation 


Matrix rdb5000: Variance Estimation 




Number of fitting points 


(a) Dynamic variance estimation with 1LU 


(b) Dynamic variance estimation with SVD 


Figure 5: Comparing estimated variances and actual variances of unit vector on Efj, (denoted as Var(diag(E)_fit )) and Rademacher 
vector on A~* (denoted as Var(A~ 1 )) and E (denoted as Var(E)) of the matrix RDB5000 with ILU and SVD respectively. 




(a) Comparing D and p(M) in order of M 


(b) Comparing sorted D and sorted p(M) 


Figure 6: Compare D and p(M) of the matrix RDB5000 in different order where M is computed by ILU. 


pattern of D, the order of its elements does not correspond exactly to that of D; hence the small reduction 
in Var(T ej (Efi t )). However, Figure 6(b) reveals that the distributions of the sorted p(M) and the sorted D 
almost coincide; hence, the two integrals Tr(p(M )) and Tr(D) are very close. 

The main obstacle for using our method as a standalone kernel for trace estimation is that there is no 
known way to measure or bound the relative trace error. Resorting to the confidence interval computed by 
the variance of a MC estimator is pessimistic as our results show (see also (6JIT71). We note the similarity 
to the much smaller error obtained in the average case of integrating monotonic functions with adaptive 
bilinear forms versus the worst case known bounds |28ll29ll . 

Motivated by these previous research results, we show how to develop practical criteria to monitor the 
relative trace error in our Algorithms. Suppose at each step of the fitting process in Algorithm [5] we collect 
a sequence of trace estimates 7}, i e [1 . maxPts], Consider the trace estimations in two successive steps. 


\Tj-T m \ _ |(7) — Tr(D)) - (7} + i - 7Y(D))| _ \Ej-E i+ i\ < 2max(\Ej \, \E i+ i |) _ 2max{\Ej\, \E i+l \) 

17}+i | " 17}+! | “ 17}+! | - \T i+l \ ~ \Tr(D)\ 

( 20 ) 

As long as 7} converges with more fitting points, the relative difference of two successive trace estimations 
can serve as an approximation to the relative error. However, when the global pattern provided by M and 
p(M) is not fully matched to that of D, convergence of 7} stagnates until enough points have been added to 
resolve the various local patterns. To determine whether the current relative trace error estimation can be 
trusted, we present our second heuristic by considering the error bound of our fitting models. 
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When approximating f(M) with p(M), the PCHIP Hermite cubic splines with k points on the interval 
[a, p], the bound on the error E(M) = f(M) — p(M) is given by ll36l . 


\E(M)\ < 


1 

384 


* 4 ||/ (4) ll 


oo,[a,p] ) 


( 21 ) 


where h = and H/) * * * 4 )]]^ [ 0/ p- denotes the maximum value of the fourth derivative of / in the entire 
interval [a, P]. Since || f 1 ' 4 ’ ||„, ,[a,pi/384 is a constant, in two successive fitting steps we have, 


\E,{M)\ 

\E i+ i(M)\ 


h 4 

n i +1 


((P-«)AiQ 4 = ,*/+h4 

((P-Ot )/k m ) 4 [ ki ’ ’ 


( 22 ) 


where kj and ki +1 are the number of fitting points in two consecutive steps. We can use (22 1 to estimate 
the maximum possible improvement between two consecutive trace errors. If the i + 1 trace error estimate 
reduces over the z'-th estimate by a factor of more than (kj/kj + 1 ) 4 , we do not trust it. 


Algorithm 6 Dynamic relative trace error evaluation algorithm during the fitting process 
Input: TraceFit from Alg. [5] 

Output: TraceErr estimation 

% TraceErr(i) is defined only for i > 5 and we assume TraceErr(6) is well defined 
1: if i == 6 then 

2: TraceErr(6) = \T race Fit (6) — TraceFit(5)\/\TraceFit(5)\ 

3: end if 

4: if i > 6 then 

5: TempTraceErr = \TraceFit{i) — TraceFit(i — l)|/|7>aceFzY(z)| 

6: if {TempTraceErr/TraceErr(i — 1) > ((/— l)/z’) 4 ) then 

7: TraceErr(i ) = TempTraceErr 

8 : else 

9: TraceErr(i ) = TraceErr(i— 1) * ((/— l)//) 9 / 4 

10: end if 

11: end if 


One caveat is that (21 1 may not be tight since the same bound holds for each subinterval [M{sj), M(sj + \)]. 
This means that a high derivative in one subinterval might dominate the bound in ( |2Tj ) but should not affect 
the error in other intervals. Therefore, convergence might not be fully dictated by ( |22] ). In practice, we found 
that the improvement ratio is between 0{kj/ki + \) and 0{(kj/kj + \ ) 4 ). Therefore, if the current relative trace 

error estimate is determined not to be trusted, we may instead use |£) + i(M)| ~ (kj/kj + ] ) k \E\ (M) |, k E [1,4]. 

The choice of k depends on the quality of M. In our experiments, we use the geometric mean of the four 
rates, yielding k = 9/4. Recall that the corresponding ratio in MC is 0{y/ki/ki+\), which is much slower 
than the proposed trace estimation method. 

Algorithm[6]combines the two heuristics in ( |20| ) and ( |22] ) to dynamically monitor the relative trace error. 
It is called after step 12 of Algorithm[5] Figure [7] shows two examples of how our dynamic method provides 
reasonable estimates of the relative trace error. 


5. Numerical Experiments 

We run experiments on matrices that are sufficiently large to avoid misleading results due to sampling 

from small spaces but can still be inverted to obtain the exact trace error. We select matrices RDB5000 

and cfdl from the University of Florida sparse matrix collection f37l and generate three test matrices from 
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Matrix rdb5000: Relative Error of Trace 


10" 1 


Matrix rdb5000: Relative Error of Trace 




20 40 60 80 100 

Number of fitting samples 


(a) Monitoring relative trace error with ILU (b) Monitoring relative trace error with SVD 
Figure 7: Two examples of monitoring relative trace error of the matrix RDB5000 with ILU and SVD respectively. 


applications that appear in Q. The Heatflowl60 matrix is from the discretization of the linear heat flow 
problem using the simplest implicit finite difference method. The matrix Poissonl50 is from 5-point central 
difference discretization of the 2D Poisson’s equation on a square mesh. The VFH6 matrix is from the 
transverse vibration of a Vicsek fractal that is constructed self-similarly. We also use matrix matb5 which 
is a discretization of the Wilson Dirac operator on a 8 4 uniform lattice with 12 degrees of freedom at each 
node, using a mass near to critical. 

Table [2] lists these matrices along with some of their basic properties. All experiments are conducted 
using MATLAB 2013a. The number of fitting points increases as s = 5 : 100. The approximation M is com¬ 
puted by ILU with parameters type = ilutp and droptol = IE-2, or as a low rank approximation 
of 2s smallest singular vectors with accuracy IE-6 (twice the number of fitting points at each step), or by 
the bounds on the diagonal. 


Table 2: Basic information of the test matrices 


Matrix 

Order 

nnz(A) 

m 

Application 

RDB5000 

5000 

29600 

1.7E3 

computational fluid 

cfdl 

70656 

1825580 

1.8E7 

computational fluid 

Heatflowl60 

25600 

127360 

2.6E0 

linear heat flow 

Poisson 150 

22500 

111900 

1.3E4 

computational fluid 

VFH6 

15625 

46873 

7.2E1 

vicsek fractal 

matb5 

49152 

2359296 

8.2E4 

lattice QCD 


5.1. Effectiveness of the fitting models 

In Figure [8] we divide the diagonal elements of the matrix RDB5000 into three contiguous sets and 
zoom in the details. We see that despite a good M, the linear LS model cannot scale the entire M onto D. 
The more flexible piecewise approach of PCHIP results in a much better fit. 

In Figure [9j we look at three matrices with M generated using the SVD. The PCHIP model typically 
has smaller relative trace error than the LS model. We also see that as more fitting points are sampled, the 
relative trace error of both models decreases significantly at early stages and slowly after a certain point. 
This relates to the quality of M, not of the model. Typically M will approximate the global pattern of D and 
the two can be matched well with only a few fitting points. But if the local patterns of M and D differ, a large 
number of fitting points will be required. This can be seen in Figure 10 Using the permutation of M for each 
case, we plot M, D, and p{M) using 100 fitting points. For the Heatflowl60 matrix, p(M) approximates D 
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(a) Diagonal elements in (1,2000) (b) Diagonal elements in (2000,3000) (c) Diagonal elements in (3000,5000) 

Figure 8: Comparing the linear LS model with the PCHIP model on RDB5000 matrix with M from SVD. 





(a) Matrix Heatflow 160 (b) Matrix Poissonl50 (c) Matrix VFH6 

Figure 9: Comparing relative trace error between the LS model and the PCHIP model in three typical cases with SVD. 


well everywhere except for the small leftmost part of the plot, which allows the relative error to reach below 
1(T 4 before convergence slows down (Figure [ 9 ]). The behavior is similar for the Poissonl50. The issue is 
more pronounced on matrix VFH6, where M and p(M) capture the average location of D but completely 
miss the local pattern, which is reflected by a very slowly improving error in Figure [9] 

We mention that the irregularity of the relative trace errors in Figure prelates to the variability of suc¬ 
cessive updates of M and of the sampling indices, especially when M is of lower quality. 

Figure |TT| demonstrates that the PCF1IP model has smaller actual variance for MC on E/,, than the LS 
model. Therefore, we only consider the PCF1IP model in the rest of experiments. 

5.2. Comparison between the fitting model and different MC methods 

We address the question of whether the number of matrix inversions we spend on computing the fitting 
could have been used more efficiently in an MC method, specifically the Flutchinson method on A -1 and 
the Flutchinson method on E. In Table [3] we compare the relative trace error of the PCHIP model with 20 
fitting points against the relative errors of the two MC methods as computed explicitly from their respective 
standard deviations in ([2]) and (|7|), with s = 20, divided by the actual trace of D. 

When M approximates D sufficiently well, the trace from the fitted diagonal is better, and for the ILU 
approximations far better, than if we just use the Hutchinson method on A -1 (the first column of results). 
Although MC on E exploits the ILU or SVD approximation of the entire matrix (not just the diagonal that 
our method uses), we see that it does not always improve on MC on A -1 , and in some cases (cfdl with ILU) 
it is far worse. In contrast, our diagonal fitting typically improves on MC on E. The last column shows that 
even with an inexpensive diagonal approximation we obtain a similar or better error than MC on A -1 . The 
only exception is the matrix VFH6 where, as we saw earlier, M cannot capture the pattern of D. Even then, 
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Index of sorted M 



(a) Matrix Heatflow 160 


(b) Matrix Poissonl50 


Figure 10: Fitting results of three typical cases with the PCF1IP model and 



100 fitting points using SVD. 


its eiTor is close to the errors from the MC methods and, as we show next, the best method can be identified 
dynamically with only a small number of samples. 


Table 3: Relative trace error from our PCHIP model and from the MC method on A 1 and on E (computed explicitly as the standard 
deviation with s = 20 from |2]( and «[7j divided by the actual trace). 



ILU 

SVD 

Bounds 

Matrix 

Tz 2 (A- 1 ) 

PCHIP 

Tz 2 {E) 

PCHIP 

Tz 2 {E) 

PCHIP 

RDB5000 

5.2E-2 

8.1E-3 

4.8E-2 

4.1E-3 

1.2E-2 

5.3E-2 

cfdl 

1.3E-1 

2.8E-2 

8.2E+2 

8.8E-3 

1.8E-2 

2.6E-2 

Heatflow 160 

4.9E-4 

1.6E-7 

4.0E-5 

2.0E-4 

4.9E-4 

3.5E-4 

Poisson 150 

2.6E-2 

2.3E-3 

2.5E-2 

1.4E-3 

4.3E-3 

8.3E-3 

VFH6 

3.2E-3 

6.8E-5 

3.1E-5 

1.0E-2 

3.2E-3 

6.0E-2 


If the user requires better trace accuracy than our fitting technique provides, we explore the performance 
of the diagonal fitting as a variance reduction for MC with unit vectors. We compute the actual values of 
Var(T ej (Efit)), Vcir(T ej {A~ x )), Var(T ei (E)), Var(Tz 2 {A~ 1 )) and Var(Tz 2 (E)) for every step s = 5 : 100 and 
show results for three matrices in Figure [IT] Note that the low rank approximation uses 2s singular vectors. 
As before, for Heatflow 160, MC on Efit performs much better than other MC methods, achieving about 
two orders reduction in variance. For Poisson 150, MC on Efu is slightly better compared to the Hutchinson 
method on E. In contrast, for VFH6, MC with unit vectors do not perform well regardless of the diagonal. 


Matrix Hpatflnwlfif) - Variance 



Number of deflated eigenvectors or singular vectors 



Number of deflated eigenvectors or singular vectors 



Number of deflated eigenvectors or singular vectors 


(a) Matrix Heatflow 160 


(b) Matrix Poissonl50 


(c) Matrix VFH6 


Figure 11: Comparing actual variances of different MC methods in three typical cases with SVD. 
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(a) Matrix Heatflow 160 


(b) Matrix Poissonl50 


(c) Matrix VFH6 


Figure 12: Comparing estimated variances and actual variances of three MC methods with ILU. 




Matrix poisson 150: Variance Estimation 


-e-Estim. Var(diag(E)_fit) 
-A-Estim. Var(A -1 ) 

Estim. Var(E) 

■ -4- Actual Var(diag(E)_fit) 
-x- Actual Var(A _1 ) 

-a- Actual Var(E) 


0 20 40 60 80 100 

Number of fitting points 



(a) Matrix Fleatflow 160 


(b) Matrix Poissonl50 


(c) Matrix VFH6 


Figure 13: Comparing estimated variances and actual variances of three MC methods with SVD. 


5.3. Dynamic evaluation of variance and relative trace error 

The above results emphasize the importance of being able to assess quickly and accurately the relative 
differences between the variances of different methods as well as the trace error, so that we can decide 
whether to continue with fitting or which MC method to switch to. First we show the effectiveness of the 
dynamic variance evaluation algorithm for our fitting MC method on (D — p(M)) with unit vectors, and on 
A -1 and E with Rademacher vectors. Then, we evaluate our algorithm for estimating the relative trace error 
during the fitting process. 

Figure [12] compares the estimated variances with the actual variances of the three MC methods when 
increasing the number of fitting points from 5 to 100. The approximation M is computed by using ILU. 
We can see that the estimated values of Var(Tz 2 (A~ 1 )) and Var(Tz 2 (E)) converge to the actual variances 
after only a few sample points. The estimated value of Var(T ej (Efj t )) gets close to and captures the trend 
of the the actual variance as the fitting samples increase. Nevertheless, the relative differences between the 
variances of the various MC methods are apparent almost immediately. 

Figure [13] shows the same experiments when the approximation M is computed by SVD. Since M is 
updated each step, Var(Tz 2 (E)) and Var(T ej (Efi ,)) change accordingly. As with ILU, Var(Tz 2 (A ')) and 
Var(Tzn(E ) can be estimated very well in a few steps. Var(T ej (Efi t )) could be underestimated but the relative 
variance difference between these MC methods becomes clear when the fitting points increase beyond 20. 
Thus we are able to determine whether the fitting process is beneficial as a variance reduction preprocessing 
and which is the best MC method to proceed with for the trace estimation. 

Figure [14] compares the estimated relative trace error with the actual one in the cases of Figure [13] We 
observe that the estimation is accurate as the fitting samples increase, even for cases such as VFH6 where 
the fitting process is not as successful. Moreover, because our algorithm is based on upper bounds on the 
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(a) Matrix Heatflow 160 (b) Matrix Poissonl50 


(c) Matrix VFH6 


Figure 14: Comparing estimated relative trace error with actual relative trace error with SVD. 
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(a) PCHIP results 


(b) Relative trace error 


(c) Variances 


Figure 15: Fitting results, dynamic evaluation of relative trace error and variances with SVD on a large QCD matrix. In Figure 
|15(b)[ the green square denotes the relative trace error by applying hierarchical probing technique on deflated matrix E in |4|. 


error of a piecewise cubic polynomial, the actual relative trace error could be lower than predicted. 

5.4. A large QCD problem 

The trace estimator presented in this paper has the potential of improving a number of LQCD calcula¬ 
tions, where the trace of the Dirac matrix is related to an important property of QCD called spontaneous 
chiral symmetry breaking Il23l . The authors in |[4|] presented the method of hierarchical probing that achieves 
almost optimal variance reduction incrementally and inexpensively on regular lattices. 

As shown in Figure [T5J a), a low rank approximation with 200 singular vectors yields a good approxima¬ 
tion M and an excellent fit p{M). In Figure |T5jb), we see that the actual relative trace error decreases very 
fast to the order of le-4 with increasing number of fitting points and singular vectors, and can be moni¬ 
tored well by our dynamic trace estimation algorithm. These singular vectors can be approximated while 
solving the linear systems with eigCG. Interestingly, we can improve the relative trace error of hierarchical 
probing by two orders of magnitude. In addition, the variances of different MC methods can be estimated 
dynamically to allow us to continue with the best estimator if needed (Figure [I5jc)). 

6. Conclusion and future work 

A novel method has been presented to estimate the trace of the matrix inverse by exploiting the pattern 
correlation between the diagonal of the inverse of the matrix and some approximation. The key idea is 
to construct a good approximation M « D through eigenvectors or some preconditioner, sample important 
patterns of D by using the distribution of the elements of M, and use fitting techniques to obtain a better 
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approximation p(M) ~ D from where we obtain a trace estimate. The proposed method can provide a fast 
trace estimate with 2-3 digits relative accuracy given only a few samples while may or may not improve the 
variance of MC. When the variance is reduced sufficiently, our method can be also used as a diagonal estima¬ 
tor. We also propose an effective dynamic variance evaluation algorithm to determine the MC method with 
the smallest variance and a dynamic relative trace error estimation algorithm without any additional costs. 
We demonstrated the effectiveness of these methods through a set of experiments in some real applications. 
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